The prominent inequality of wealth and income is a huge concern especially in the United States. The likelihood of diminishing poverty is one valid reason to reduce the world’s surging level of economic inequality. The principle of universal moral equality ensures sustainable development and improves the economic stability of a nation. Governments in different countries have been trying their best to address this problem and provide an optimal solution. The aim here to show the usage of machine learning techniques in providing a solution to the income equality problem. The UCI Adult Dataset has been used for the purpose. Specifically, several machine learning classification models have been compared to predict whether a person’s yearly income in the US falls in the income category of either greater than 50K dollars or less/equal to 50K dollars category based on a certain set of attributes. So, what_Y(>50, <=50) is predicted given (X1, X2, X3,…Xn), where Y is an income level, and X is a statistic feature of an individual.
library(knitr)
library(ggvis)
library(ISLR)
library(e1071)
library(gmodels)
library(tidyverse)
library(tidyr)
library(dplyr)
library(readr)
library(ggplot2)
library(randomForest)
library(caret)
library(data.table)
library(gbm)
library(rpart)
library(rpart.plot)
library(plotly)
library(ggvis)
library(neuralnet)
library(MASS)
train <- fread("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")
test <- fread("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test")
## Warning in fread("http://archive.ics.uci.edu/ml/machine-learning-databases/
## adult/adult.test"): Detected 1 column names but the data has 15 columns
## (i.e. invalid file). Added 14 extra default column names at the end.
##
## TRUE
## 32561
##
## Under_50K More_50K
## 76 24
## [1] 0
## [1] 0
P <- ggplot(train,aes(x = Age, fill = IncomeCLASS)) + geom_bar(position = "fill", color = "black") + coord_flip()
P1 <- P + labs(title = "Age vs Income Class Proportion")
ggplotly(P1 = ggplot2::last_plot())
gg <- train %>% ggvis(x = ~Age) %>% layer_histograms(fill = "green")
gg
P <- ggplot(train,aes(x = Age, fill = IncomeCLASS)) + geom_bar(position = "fill", color = "black") + coord_flip()
P1 <- P + labs(title = "Age vs Income Class Proportion")
ggplotly(P1 = ggplot2::last_plot())
Q <- ggplot(train,aes(x = Education, fill = IncomeCLASS)) + geom_bar() + coord_flip()
Q1 <- Q + labs(title = "Education vs Income Class Proportion")
ggplotly(Q1 = ggplot2::last_plot())
train %>% ggvis(x = ~Age) %>% layer_histograms() %>% layer_paths( strokeWidth := 1, stroke := "red")
R <- train %>% ggplot(aes(x = Work_Class, fill = IncomeCLASS)) + geom_bar(position = "fill", color = "black") + coord_flip()
R1 <- R + labs(title = "Work_Class vs Income Class Proportion")
ggplotly(R1 = ggplot2::last_plot())
We find that the people employed in private companies have more people with income above 50k and Self Employed people having a higher proportion of peoplw with income greater than 50k.
hh <- train %>% ggvis(x = ~Education_Num) %>% layer_histograms(fill = "green")
S <- ggplot(train,aes(x = Education_Num, fill = IncomeCLASS)) + ggtitle("Length of Eduction VS Income Class Proportion") + xlab("Years of Education") + ylab("proportion within category") + geom_bar(fill = "green", color = "black") + coord_flip()
S1 <- S + labs(title = "Education_Num, vs Income Class Proportion")
ggplotly(S1 = ggplot2::last_plot())
T <- ggplot(train,aes(x = Relationship, fill = IncomeCLASS)) + ggtitle("Relationship VS Income Class Proportion") + xlab("Relationship") + ylab("proportion within category") + geom_bar(position = "fill", color = "black") + coord_flip()
T1 <- T + labs(title = "Relationship vs Income Class Proportion")
ggplotly(T1 = ggplot2::last_plot())
jj <- ggplot(train,aes(x = Occupation, fill = IncomeCLASS)) + ggtitle("Occupation VS Income Class Proportion") + xlab("Occupation") + ylab("proportion within category") + geom_bar(position = "fill", color = "black") + coord_flip()
ggplotly(jj = ggplot2::last_plot())
PERFORMANCE METRICS for MODEL SELECTION:
Accuracy Statistic: Kappa Statistic:
#train Naive Bayes
model_Naive <- naiveBayes(IncomeCLASS ~ ., data = train)
pred_Nb <- predict(model_Naive, test)
confusionMatrix(pred_Nb, test$IncomeCLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11563 1968
## More_50K 872 1878
##
## Accuracy : 0.8256
## 95% CI : (0.8196, 0.8314)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4638
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9299
## Specificity : 0.4883
## Pos Pred Value : 0.8546
## Neg Pred Value : 0.6829
## Prevalence : 0.7638
## Detection Rate : 0.7102
## Detection Prevalence : 0.8311
## Balanced Accuracy : 0.7091
##
## 'Positive' Class : Under_50K
##
CrossTable(pred_Nb, test$IncomeCLASS)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 16281
##
##
## | test$IncomeCLASS
## pred_Nb | Under_50K | More_50K | Row Total |
## -------------|-----------|-----------|-----------|
## Under_50K | 11563 | 1968 | 13531 |
## | 146.006 | 472.069 | |
## | 0.855 | 0.145 | 0.831 |
## | 0.930 | 0.512 | |
## | 0.710 | 0.121 | |
## -------------|-----------|-----------|-----------|
## More_50K | 872 | 1878 | 2750 |
## | 718.400 | 2322.753 | |
## | 0.317 | 0.683 | 0.169 |
## | 0.070 | 0.488 | |
## | 0.054 | 0.115 | |
## -------------|-----------|-----------|-----------|
## Column Total | 12435 | 3846 | 16281 |
## | 0.764 | 0.236 | |
## -------------|-----------|-----------|-----------|
##
##
summary(pred_Nb)
## Under_50K More_50K
## 13531 2750
cm_Nb <- data.frame(confusionMatrix(pred_Nb, test$IncomeCLASS)[3])
kable(cm_Nb)
| overall | |
|---|---|
| Accuracy | 0.8255635 |
| Kappa | 0.4638227 |
| AccuracyLower | 0.8196457 |
| AccuracyUpper | 0.8313648 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
# rpart decision tree
set.seed(32323)
V <- 10
T <- 4
TrControl <- trainControl(method = "repeatedcv",
number = V,
repeats = T)
model_part <- caret::train(IncomeCLASS ~., data = train, method = "rpart", control = rpart::rpart.control(minsplit = 5, cp = 0), tuneGrid = data.frame(cp = .02), trControl = TrControl)
pred_rpart <- predict(model_part, test, type = "raw")
confusionMatrix(pred_rpart, test$IncomeCLASS)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Under_50K More_50K
## Under_50K 11803 1895
## More_50K 632 1951
##
## Accuracy : 0.8448
## 95% CI : (0.8391, 0.8503)
## No Information Rate : 0.7638
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5148
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9492
## Specificity : 0.5073
## Pos Pred Value : 0.8617
## Neg Pred Value : 0.7553
## Prevalence : 0.7638
## Detection Rate : 0.7250
## Detection Prevalence : 0.8413
## Balanced Accuracy : 0.7282
##
## 'Positive' Class : Under_50K
##
CrossTable(pred_rpart, test$IncomeCLASS)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 16281
##
##
## | test$IncomeCLASS
## pred_rpart | Under_50K | More_50K | Row Total |
## -------------|-----------|-----------|-----------|
## Under_50K | 11803 | 1895 | 13698 |
## | 171.840 | 555.598 | |
## | 0.862 | 0.138 | 0.841 |
## | 0.949 | 0.493 | |
## | 0.725 | 0.116 | |
## -------------|-----------|-----------|-----------|
## More_50K | 632 | 1951 | 2583 |
## | 911.290 | 2946.410 | |
## | 0.245 | 0.755 | 0.159 |
## | 0.051 | 0.507 | |
## | 0.039 | 0.120 | |
## -------------|-----------|-----------|-----------|
## Column Total | 12435 | 3846 | 16281 |
## | 0.764 | 0.236 | |
## -------------|-----------|-----------|-----------|
##
##
summary(pred_rpart)
## Under_50K More_50K
## 13698 2583
model_part$finalModel
## n= 32561
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 32561 7841 Under_50K (0.75919044 0.24080956)
## 2) Marital_StatusMarried-civ-spouse< 0.5 17585 1149 Under_50K (0.93466022 0.06533978)
## 4) Capital_Gain< 7073.5 17274 849 Under_50K (0.95085099 0.04914901) *
## 5) Capital_Gain>=7073.5 311 11 More_50K (0.03536977 0.96463023) *
## 3) Marital_StatusMarried-civ-spouse>=0.5 14976 6692 Under_50K (0.55315171 0.44684829)
## 6) Education_Num< 12.5 10507 3478 Under_50K (0.66898258 0.33101742)
## 12) Capital_Gain< 5095.5 9979 2961 Under_50K (0.70327688 0.29672312) *
## 13) Capital_Gain>=5095.5 528 11 More_50K (0.02083333 0.97916667) *
## 7) Education_Num>=12.5 4469 1255 More_50K (0.28082345 0.71917655) *
cm_rpart <- data.frame(confusionMatrix(pred_rpart, test$IncomeCLASS)[3])
kable(cm_rpart)
| overall | |
|---|---|
| Accuracy | 0.8447884 |
| Kappa | 0.5148460 |
| AccuracyLower | 0.8391340 |
| AccuracyUpper | 0.8503193 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
rpart.plot(model_part$finalModel)
model.rf <- randomForest(IncomeCLASS~., data = train, ntree = 750, importance = TRUE)
pred.rf <- predict(model.rf, test)
summary(pred.rf)
plot(model.rf)
confusionMatrix(test$IncomeCLASS, pred.rf)
cm.rf <- data.frame(confusionMatrix(pred.rf, test$IncomeCLASS)[3])
kable(cm.rf)
train_n <- train %>% sapply(as.numeric)
train_n <- as.data.frame(train_n)
model_neural <- neuralnet(IncomeCLASS ~ ., data = train_n, hidden = 1, rep = 5, act.fct = "logistic", linear.output = FALSE)
1
## [1] 1
plot(model_neural, rep = "best")
summary(model_neural)
## Length Class Mode
## call 7 -none- call
## response 32561 -none- numeric
## covariate 423293 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 14 data.frame list
## exclude 0 -none- NULL
## net.result 5 -none- list
## weights 5 -none- list
## generalized.weights 5 -none- list
## startweights 5 -none- list
## result.matrix 95 -none- numeric
test_n <- as.matrix(sapply(test, as.numeric))
test_n <- as.data.frame(test_n)
model_pred <- compute(model_neural, test_n)
pr.nn <- model_pred$net.result
for (i in length(train))
original_values <- test_n[,14]
pr.nn_2 <- max.col(pr.nn)
outs <- mean(pr.nn_2 == original_values)
outs
## [1] 0.7637737
kable(cm_rpart)
| overall | |
|---|---|
| Accuracy | 0.8447884 |
| Kappa | 0.5148460 |
| AccuracyLower | 0.8391340 |
| AccuracyUpper | 0.8503193 |
| AccuracyNull | 0.7637737 |
| AccuracyPValue | 0.0000000 |
| McnemarPValue | 0.0000000 |
#### overall|
####|:--------------|---------:|
####|Accuracy | 0.8643818|
####|Kappa | 0.6007945|
####|AccuracyLower | 0.8590268|
####|AccuracyUpper | 0.8696063|
####|AccuracyNull | 0.7637737|
####|AccuracyPValue | 0.0000000|
####|McnemarPValue | 0.0000000|
cor(max.col(pr.nn), test_n[,14])
## Warning in cor(max.col(pr.nn), test_n[, 14]): the standard deviation is
## zero
## [1] NA
Accuracy <- mean(pr.nn_2 == original_values)
Accuracy
## [1] 0.7637737
The analysis confirmed (and quantified) what is considered common sense:
Age, education, occupation, and marital status (or relationship kind) are good for predicting income (above a certain threshold).
About 46% of the people are in a relationship called “Husband” or “Wife” which is then further classified based on Education Level where nearly 14% who earn above $50 K have the education of Bachelors, Prof-school, Masters and Doctorate.
The other education levels have income predominantly below $50 k with just 2% having salaries above $50k who also have capital gains greater than $5096
With respect to other relationships, only 1% have income above $50 k and with capital gains greater than $7074.
In the relationship of Education and Number of People Earning > 50 k and separated by Work Class. We find that Bachelors graduates working in Private companies have a higher number of people earning above 50 k.
In the relationship of Average hours per week with respect to gender and separated by Work Class and we find that Males typically work more hours per week on Average across all work classes.
The third sheet shows that the relationship of marital status and income levels separated by Work Class. Majority of the people in Married with Civilian spouse have an income greater than 50 k and majorly in the private sector.
The fourth work sheet shows the impact of occupation, capital gain and capital loss on the income levels which has details of work class too. This is a comprehensive visualization across 4 different parameters. We find that Executives at Managerial Level have more people with income greater than 50 k and Professional Speciality has more capital gains.